Student Names: Waleed Almutairi Student IDs: 202011580
Abdulmalik Almadhi 202026200
Abdullah Alomair 202032920
Muath Alsubhi 202027420
Mohammed Aljoudi 202041460
| Column | Description |
|---|---|
| Country | Name of the country |
| Status | Developed or Developing status |
| Life expectancy | Life Expectancy in age |
| Adult Mortality | Probability of dying between 15 and 60 years per 1000 population for both sexes |
| Infant deaths | Number of Infant Deaths per 1000 population |
| Alcohol | Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol) |
| Percentage expenditure | Expenditure on health as a percentage of Gross Domestic Product per capita(%) |
| Hepatitis B | Hepatitis B (HepB) immunization coverage among 1-year-olds(%) |
| Measles | Measles - number of reported cases per 1000 population |
| BMI | Average Body Mass Index of entire population |
| Under-five deaths | Number of under-five deaths per 1000 population |
| Polio | Polio (Pol3) immunization coverage among 1-year-olds(%) |
| Total expenditure | Government expenditure on health as a percentage of total government expenditure(%) |
| Diphtheria | Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds(%) |
| GDP | Gross Domestic Product per capita (in USD) |
| Population | Population of the country |
| Thinness 1-19 years | Prevalence of thinness among children and adolescents for Age 10 to 19(%) |
| Thinness 5-9 years | Prevalence of thinness among children for Age 5 to 9(%) |
| Income composition of resources | Human Development Index in terms of income composition of resources (index ranging from 0 to 1) |
| Schooling | Number of years of Schooling(years) |
| Continent | Continent of each country |
# Importing libraries to work with
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from random import randint, random
import numpy as np
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
import numpy as np
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("..\Life Expectancy Data.csv")
# As the data is very clean, we will create some inconsistencies.
df["Schooling"] = [-x if randint(0,10) == 3 else x for x in df["Schooling"].values]
df["GDP"] = [-x if randint(0,10) == 3 else x for x in df["GDP"].values]
df["under-five deaths "] = [x+random() for x in df["under-five deaths "].values]
inconsistent_datatype = ["Schooling", "under-five deaths"]
fields = {"Fields":[str(x) for x in df.columns], "Types":[str(df[x].dtype) for x in df.columns],
"Real Type": ["int64" if x in inconsistent_datatype else str(df[x].dtype) for x in df.columns]}
fields_df = pd.DataFrame(data=fields)
display(df.head())
display(fields_df)
| Country | Year | Status | Life expectancy | Adult Mortality | infant deaths | Alcohol | percentage expenditure | Hepatitis B | Measles | ... | Polio | Total expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness 1-19 years | thinness 5-9 years | Income composition of resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2015 | Developing | 65.0 | 263.0 | 62 | 0.01 | 71.279624 | 65.0 | 1154 | ... | 6.0 | 8.16 | 65.0 | 0.1 | 584.259210 | 33736494.0 | 17.2 | 17.3 | 0.479 | 10.1 |
| 1 | Afghanistan | 2014 | Developing | 59.9 | 271.0 | 64 | 0.01 | 73.523582 | 62.0 | 492 | ... | 58.0 | 8.18 | 62.0 | 0.1 | 612.696514 | 327582.0 | 17.5 | 17.5 | 0.476 | 10.0 |
| 2 | Afghanistan | 2013 | Developing | 59.9 | 268.0 | 66 | 0.01 | 73.219243 | 64.0 | 430 | ... | 62.0 | 8.13 | 64.0 | 0.1 | -631.744976 | 31731688.0 | 17.7 | 17.7 | 0.470 | 9.9 |
| 3 | Afghanistan | 2012 | Developing | 59.5 | 272.0 | 69 | 0.01 | 78.184215 | 67.0 | 2787 | ... | 67.0 | 8.52 | 67.0 | 0.1 | 669.959000 | 3696958.0 | 17.9 | 18.0 | 0.463 | 9.8 |
| 4 | Afghanistan | 2011 | Developing | 59.2 | 275.0 | 71 | 0.01 | 7.097109 | 68.0 | 3013 | ... | 68.0 | 7.87 | 68.0 | 0.1 | 63.537231 | 2978599.0 | 18.2 | 18.2 | 0.454 | 9.5 |
5 rows × 22 columns
| Fields | Types | Real Type | |
|---|---|---|---|
| 0 | Country | object | object |
| 1 | Year | int64 | int64 |
| 2 | Status | object | object |
| 3 | Life expectancy | float64 | float64 |
| 4 | Adult Mortality | float64 | float64 |
| 5 | infant deaths | int64 | int64 |
| 6 | Alcohol | float64 | float64 |
| 7 | percentage expenditure | float64 | float64 |
| 8 | Hepatitis B | float64 | float64 |
| 9 | Measles | int64 | int64 |
| 10 | BMI | float64 | float64 |
| 11 | under-five deaths | float64 | float64 |
| 12 | Polio | float64 | float64 |
| 13 | Total expenditure | float64 | float64 |
| 14 | Diphtheria | float64 | float64 |
| 15 | HIV/AIDS | float64 | float64 |
| 16 | GDP | float64 | float64 |
| 17 | Population | float64 | float64 |
| 18 | thinness 1-19 years | float64 | float64 |
| 19 | thinness 5-9 years | float64 | float64 |
| 20 | Income composition of resources | float64 | float64 |
| 21 | Schooling | float64 | int64 |
# Adding Extra Categorical Column named Continent
europe = ['Albania', 'Austria', 'Belgium', 'Bulgaria', 'Belarus', 'Germany', 'Denmark', 'Estonia', 'Finland',
'Greece', 'Hungary', 'Iceland', 'Italy', 'Lithuania', 'Luxembourg', 'Latvia', 'Malta', 'Norway', 'Poland',
'Portugal', 'Romania', 'Sweden', 'Slovenia', 'Slovakia', 'San Marino', 'Bolivia (Plurinational State of)',
'Ukraine', 'Bosnia and Herzegovina', 'Croatia', 'Monaco', 'Montenegro', 'Serbia', 'Spain', 'Switzerland',
'Czechia', 'Democratic People\'s Republic of Korea','Netherlands', 'Republic of Moldova', 'The former Yugoslav republic of Macedonia',
'United Kingdom of Great Britain and Northern Ireland']
africa = ['Angola', 'Burkina Faso', 'Burundi', 'Benin', 'Botswana', 'Congo', 'Cameroon',
'Djibouti', 'Egypt', 'Eritrea', 'Ethiopia', 'Gabon', 'Ghana', 'Guinea', 'Guinea-Bissau', 'Kenya', 'Liberia',
'Libya', 'Madagascar', 'Mali', 'Mauritania', 'Mauritius', 'Malawi', 'Mozambique', 'Namibia', 'Niger',
'Rwanda', 'Seychelles', 'Sudan', 'Senegal', 'Somalia', 'Togo', 'Tunisia', 'Uganda', 'Zambia', 'Zimbabwe',
'Algeria', 'Central African Republic', 'Chad', 'Comoros', 'Equatorial Guinea', 'Morocco', 'South Africa',
'Swaziland', 'Cabo Verde', "Côte d'Ivoire", 'Gambia', 'Sao Tome and Principe', 'South Sudan', 'United Republic of Tanzania', 'Democratic Republic of the Congo']
asia = ['Afghanistan', 'Armenia', 'Azerbaijan', 'Bangladesh', 'Bahrain', 'Brunei Darussalam', 'Cyprus', 'Georgia', 'Indonesia', 'Israel',
'Iraq', 'Jordan', 'Japan', 'Kyrgyzstan', 'Kuwait', 'Lebanon', 'Myanmar', 'Mongolia', 'Maldives', 'Malaysia', 'Oman',
'Philippines', 'Qatar', 'Saudi Arabia', 'Singapore', 'Thailand', 'China',
'Tajikistan', 'Turkmenistan', 'Turkey', 'Uzbekistan', 'Yemen', 'Cambodia', 'Kazakhstan', 'United Arab Emirates', 'Iran (Islamic Republic of)', "Lao People's Democratic Republic", 'Republic of Korea',
'Russian Federation', 'Syrian Arab Republic', 'Timor-Leste', 'Viet Nam']
north_america = ['Antigua and Barbuda', 'Barbados', 'Bahamas', 'Belize', 'Canada', 'Costa Rica', 'Cuba', 'Dominica',
'Dominican Republic', 'Guatemala', 'Haiti', 'Honduras', 'Jamaica', 'Mexico', 'Nicaragua', 'Panama',
'Trinidad and Tobago', 'El Salvador', 'Grenada', 'Saint Kitts and Nevis', 'Saint Lucia',
'Saint Vincent and the Grenadines', 'United States of America']
south_america = ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Ecuador', 'Guyana', 'Peru', 'Paraguay',
'Suriname', 'Uruguay' , 'Venezuela (Bolivarian Republic of)']
oceania = ['Fiji', 'Kiribati', 'New Zealand', 'Papua New Guinea', 'Solomon Islands', 'Tonga', 'Vanuatu', 'Samoa', 'Cook Islands', 'Micronesia (Federated States of)', 'Niue']
continent = []
for country in df['Country'].values:
if (country in europe):
continent.append('Europe')
elif (country in africa):
continent.append('Africa')
elif (country in asia):
continent.append('Asia')
elif (country in north_america):
continent.append('North America')
elif (country in south_america):
continent.append('South America')
elif (country in oceania):
continent.append('Oceania')
else:
continent.append('Unknown')
df['Continent'] = continent
# List of created inconsistent data.
df.info()
inconsistent_data = inconsitent_datatype = ["Schooling", "under-five deaths ", "GDP"]
fields2 = {"Fields":[str(x) for x in df.columns],
"Inconsistencies":[True if x in inconsistent_data else False for x in df.columns],
"Missing Data": [True if df[col].isnull().any() else False for col in df.columns]}
# Create dataframe to display inconsistent and missing data.
fields2_df = pd.DataFrame(data=fields2)
display(fields2_df)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2938 entries, 0 to 2937 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 2938 non-null object 1 Year 2938 non-null int64 2 Status 2938 non-null object 3 Life expectancy 2928 non-null float64 4 Adult Mortality 2928 non-null float64 5 infant deaths 2938 non-null int64 6 Alcohol 2744 non-null float64 7 percentage expenditure 2938 non-null float64 8 Hepatitis B 2385 non-null float64 9 Measles 2938 non-null int64 10 BMI 2904 non-null float64 11 under-five deaths 2938 non-null float64 12 Polio 2919 non-null float64 13 Total expenditure 2712 non-null float64 14 Diphtheria 2919 non-null float64 15 HIV/AIDS 2938 non-null float64 16 GDP 2490 non-null float64 17 Population 2286 non-null float64 18 thinness 1-19 years 2904 non-null float64 19 thinness 5-9 years 2904 non-null float64 20 Income composition of resources 2771 non-null float64 21 Schooling 2775 non-null float64 22 Continent 2938 non-null object dtypes: float64(17), int64(3), object(3) memory usage: 528.0+ KB
| Fields | Inconsistencies | Missing Data | |
|---|---|---|---|
| 0 | Country | False | False |
| 1 | Year | False | False |
| 2 | Status | False | False |
| 3 | Life expectancy | False | True |
| 4 | Adult Mortality | False | True |
| 5 | infant deaths | False | False |
| 6 | Alcohol | False | True |
| 7 | percentage expenditure | False | False |
| 8 | Hepatitis B | False | True |
| 9 | Measles | False | False |
| 10 | BMI | False | True |
| 11 | under-five deaths | True | False |
| 12 | Polio | False | True |
| 13 | Total expenditure | False | True |
| 14 | Diphtheria | False | True |
| 15 | HIV/AIDS | False | False |
| 16 | GDP | True | True |
| 17 | Population | False | True |
| 18 | thinness 1-19 years | False | True |
| 19 | thinness 5-9 years | False | True |
| 20 | Income composition of resources | False | True |
| 21 | Schooling | True | True |
| 22 | Continent | False | False |
# List of columns with NaN values.
null_columns=df.columns[df.isna().any()]
# Imputing all columns with NaN values.
for c in null_columns:
if df[c].dtype!='object':
value = df[c].mean()
else:
value = df[c].mode()
value = value[0]
df[c].fillna(value,inplace=True)
# Fix the inconsistent datatypes of float64 to int64.
df['under-five deaths '] = df['under-five deaths '].apply(lambda x: int(x))
# Fix the inconsistent datatypes with negative values.
df['GDP'] = df['GDP'].apply(lambda x: abs(x))
df['Schooling'] = df['Schooling'].apply(lambda x: abs(x))
plt.figure(figsize=(15,10))
sns.boxplot(data=df)
plt.xticks(rotation=45)
plt.show()
numeric_columns = df.select_dtypes(exclude='object').columns.drop('Year')
# From this graph we decided to make the threshold = 2 to limit most outliers.
# Scaled all numeric data.
scaled_values = StandardScaler().fit_transform(df[numeric_columns])
df2=pd.DataFrame(scaled_values,columns=df[numeric_columns].columns)
# Plotting scaled and non-scaled dataframe
plt.figure(figsize=(15,10))
sns.boxplot(data=df2)
plt.xticks(rotation=45)
plt.show()
# print shape to check initial rows.
print(df2.shape)
# From this graph we decided to make the threshold = 2 to limit most outliers.
threshold = 2
selected_rows= (df2<threshold).all(axis=1) & (df2>-threshold).all(axis=1)
selected_index=df[~selected_rows].index
df2.drop(index=selected_index,inplace=True)
ndf=df.drop(index=selected_index)
ndf.reset_index(inplace = True, drop = True)
# Print shape to check final amount of rows.
print(df2.shape)
plt.figure(figsize=(15,15))
sns.boxplot(data=df2)
plt.xticks(rotation=45)
plt.show()
# Replacing the values from df and df2
df.drop(index=selected_index,inplace=True)
print(df.shape)
(2938, 19) (1728, 19)
(1728, 23)
# Statistical summary of numeric columns.
display(df.describe().T)
# Statistical summary of categorical columns.
display(df.describe(include='object').T)
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Year | 1728.0 | 2.008086e+03 | 4.608478e+00 | 2000.00000 | 2004.000000 | 2.008000e+03 | 2.012000e+03 | 2.015000e+03 |
| Life expectancy | 1728.0 | 7.079758e+01 | 7.291746e+00 | 51.00000 | 66.300000 | 7.270000e+01 | 7.540000e+01 | 8.800000e+01 |
| Adult Mortality | 1728.0 | 1.465064e+02 | 9.325697e+01 | 1.00000 | 76.000000 | 1.410000e+02 | 1.990000e+02 | 4.120000e+02 |
| infant deaths | 1728.0 | 1.245428e+01 | 2.559142e+01 | 0.00000 | 0.000000 | 2.000000e+00 | 1.225000e+01 | 2.380000e+02 |
| Alcohol | 1728.0 | 4.439963e+00 | 3.629968e+00 | 0.01000 | 1.007500 | 4.260000e+00 | 7.010000e+00 | 1.243000e+01 |
| percentage expenditure | 1728.0 | 3.425669e+02 | 6.459938e+02 | 0.00000 | 3.984620 | 7.524183e+01 | 3.777792e+02 | 4.506256e+03 |
| Hepatitis B | 1728.0 | 8.809968e+01 | 1.175700e+01 | 36.00000 | 80.940461 | 9.300000e+01 | 9.700000e+01 | 9.900000e+01 |
| Measles | 1728.0 | 7.526817e+02 | 2.607044e+03 | 0.00000 | 0.000000 | 5.000000e+00 | 1.522500e+02 | 2.478900e+04 |
| BMI | 1728.0 | 4.112592e+01 | 1.886290e+01 | 2.00000 | 24.575000 | 4.700000e+01 | 5.630000e+01 | 7.730000e+01 |
| under-five deaths | 1728.0 | 1.691493e+01 | 3.560029e+01 | 0.00000 | 0.000000 | 3.000000e+00 | 1.500000e+01 | 3.310000e+02 |
| Polio | 1728.0 | 8.956056e+01 | 1.197609e+01 | 38.00000 | 85.750000 | 9.500000e+01 | 9.800000e+01 | 9.900000e+01 |
| Total expenditure | 1728.0 | 5.769482e+00 | 1.924121e+00 | 1.15000 | 4.530000 | 5.920000e+00 | 6.900000e+00 | 9.950000e+00 |
| Diphtheria | 1728.0 | 8.939929e+01 | 1.224797e+01 | 36.00000 | 86.000000 | 9.400000e+01 | 9.800000e+01 | 9.900000e+01 |
| HIV/AIDS | 1728.0 | 7.650463e-01 | 1.669615e+00 | 0.10000 | 0.100000 | 1.000000e-01 | 4.000000e-01 | 1.170000e+01 |
| GDP | 1728.0 | 4.740394e+03 | 5.812111e+03 | 1.68135 | 692.573517 | 3.179672e+03 | 5.815971e+03 | 3.281617e+04 |
| Population | 1728.0 | 8.479216e+06 | 1.270535e+07 | 123.00000 | 423820.750000 | 3.526114e+06 | 1.275338e+07 | 1.173189e+08 |
| thinness 1-19 years | 1728.0 | 4.058771e+00 | 2.914535e+00 | 0.10000 | 1.700000 | 3.200000e+00 | 6.325000e+00 | 1.360000e+01 |
| thinness 5-9 years | 1728.0 | 4.079678e+00 | 2.940589e+00 | 0.10000 | 1.700000 | 3.300000e+00 | 6.300000e+00 | 1.370000e+01 |
| Income composition of resources | 1728.0 | 6.711016e-01 | 1.338054e-01 | 0.28600 | 0.600750 | 6.950000e-01 | 7.690000e-01 | 9.480000e-01 |
| Schooling | 1728.0 | 1.229178e+01 | 2.558657e+00 | 5.30000 | 10.500000 | 1.250000e+01 | 1.410000e+01 | 1.840000e+01 |
| count | unique | top | freq | |
|---|---|---|---|---|
| Country | 1728 | 173 | Iran (Islamic Republic of) | 16 |
| Status | 1728 | 2 | Developing | 1466 |
| Continent | 1728 | 6 | Africa | 446 |
numeric_columns = df.select_dtypes(exclude='object').columns
_, axes = plt.subplots(4,5, figsize=(20,20))
for ind, col in enumerate(numeric_columns):
sns.histplot(x=col,bins=10,kde=True,data=df, ax=axes.flatten()[ind])
plt.xticks(rotation=45)
plt.show()
# Excluding Country as it has a lot of unique values
cat_columns = df.select_dtypes(include='object').columns.drop('Country')
_, axes = plt.subplots(2, 1, figsize=(15,10))
for ind, col in enumerate(cat_columns):
sns.countplot(y=col, data=df, ax=axes.flatten()[ind])
plt.show()
selected_columns = numeric_columns.drop('Life expectancy ')
_, axes = plt.subplots(4,5, figsize=(20,20))
for ind, col in enumerate(selected_columns):
sns.scatterplot(x=col, y='Life expectancy ', data=df, ax=axes.flatten()[ind])
plt.show()
selected_columns = numeric_columns.drop('Polio')
_, axes = plt.subplots(4,5, figsize=(20,20))
for ind, col in enumerate(selected_columns):
sns.scatterplot(x=col, y='Polio', data=df, ax=axes.flatten()[ind])
plt.show()
plt.figure(figsize=(15,10))
sns.countplot(y='Continent', hue='Status', data=df)
plt.show()
selected_columns = numeric_columns.drop('Life expectancy ')
_, axes = plt.subplots(7,3, figsize=(30,30))
for ind, col in enumerate(selected_columns):
sns.scatterplot(x=col, y='Life expectancy ', hue='Continent', style='Status',
data=df, ax=axes.flatten()[ind])
plt.show()
We can observe the following from the histogram in 4.3.2:
We can see in 4.4 that:
GDP and Percentage Experience have a moderately positive linear relationship. Furthermore, developed countries have a high GDP and percentage of experience.
Income Composition of Resources and Schooling have a strong positive linear relationship. Furthermore, developed countries have higher education levels than developing countries.
Polio and diphtheria have a strong positive linear relationship. Furthermore, industrialized countries have many fewer cases of diphtheria and polio than undeveloped countries.
Diphtheria and Hepatitis B have a moderately positive linear relationship. Furthermore, developed countries have many fewer cases of Diphtheria and Polio than developing countries.
Polio and Hepatitis B have a moderately positive linear relationship. Furthermore, developed countries have many fewer cases of diphtheria and polio than developing countries.
Schooling and adult mortality have a moderately negative linear relationship. Furthermore, developed countries have better levels of education and lower adult mortality rates than developing countries.
Life Expectancy and Schooling have a strong positive linear relationship*. Furthermore, developed countries have better levels of education and life expectancy than developing countries.
Life Expectancy and Years have a strong positive linear relationship. Furthermore, developed countries have a higher Life Expectancy through time than developing countries.
Life expectancy and adult mortality have a strong positive inversly linear relationship. Furthermore, adult mortality in developed countries is lower and life expectancy is higher. However, adult mortality is significantly greater in developing countries, and life expectancy is significantly lower.
Life Expectancy and Alcohol have a moderately positive linear relationship. Furthermore, developed countries consume more alcohol and have longer life expectancies. However, developing countries have lower alcohol use and life expectancy than developed countries. As a result, the impact of alcohol on developed countries is reduced.
Schooling and adult mortality have a moderately negative linear relationship. Furthermore, developed countries have better schooling and lower adult mortality rates than developing countries.
# This function automatically finds the best method
def regressionAnalysis(input_col, output_y):
# Checks for Multiple inputs
if len(input_col) > 1:
X = df.loc[:, input_col].values
print(f'{" and ".join(input_col)} vs {output_y}')
else:
X = df[input_col].values.reshape(-1, 1)
y = df[output_y].values.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
scaler.fit(np.c_[X_train,y_train])
A_train = scaler.transform(np.c_[X_train,y_train])
X_train = A_train[:,:-1]
y_train = A_train[:,-1]
A_test = scaler.transform(np.c_[X_test,y_test])
X_test = A_test[:,:-1]
y_test = A_test[:,-1]
# OLS
reg1 = LinearRegression(fit_intercept=False).fit(X_train, y_train)
y_pred1 = reg1.predict(X_test)
mse1 = round(mean_squared_error(y_test, y_pred1),5)
print('The MSE using OLS is:', mse1)
## RidgeCV Analysis
reg2 = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3], fit_intercept=False,cv=10).fit(X_train, y_train)
y_pred2 = reg2.predict(X_test)
mse2 = round(mean_squared_error(y_test, y_pred2),5)
print('The MSE using Ridge is:', mse2)
## LassoCV Analysis
reg3 = LassoCV(alphas=[1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3],
fit_intercept=False,cv=10, random_state=0).fit(X_train, y_train)
y_pred3 = reg3.predict(X_test)
mse3 = round(mean_squared_error(y_test, y_pred3),5)
print('The MSE using Lasso is:', mse3)
# Find MSE with smallest value.
best_mse = round(min([mse1,mse2,mse2]),5)
#
if best_mse is mse1:
best_method_name = 'Linear Regression'
best_method = LinearRegression()
elif best_mse is mse2:
best_method_name = 'Ridge Regression'
best_method = RidgeCV()
else:
best_method_name = 'Lasso Regression'
best_method = LassoCV()
print(f'The best method with smallest MSE is {best_method_name} with {best_mse}' )
# Creating a plot that shows each input and output variable and the slope.
for ind, col in enumerate(input_col):
plt.figure(figsize=(5,5))
x_values = np.arange(min(df[col]), max(df[col])).reshape(-1,1)
best_method.fit(df[col].values.reshape(-1,1), df[output_y].values.reshape(-1,1))
sns.scatterplot(x=col, y= output_y, hue='Continent', style='Status', data=df)
y_head = best_method.predict(x_values)
plt.plot(x_values, y_head, color="red")
plt.title(f'{col} vs {output_y}')
plt.xlabel(col)
plt.ylabel(output_y)
plt.show()
regressionAnalysis(['Polio', 'Hepatitis B'], 'Diphtheria ')
Polio and Hepatitis B vs Diphtheria The MSE using OLS is: 0.06927 The MSE using Ridge is: 0.06975 The MSE using Lasso is: 0.06928 The best method with smallest MSE is Lasso Regression with 0.06927
regressionAnalysis(['Schooling', 'Income composition of resources'], 'Life expectancy ')
Schooling and Income composition of resources vs Life expectancy The MSE using OLS is: 0.35212 The MSE using Ridge is: 0.35216 The MSE using Lasso is: 0.35217 The best method with smallest MSE is Lasso Regression with 0.35212
lasso regression performs better in both graphs and OLS performs worse meaning that most of the inputs except the selected ones are unrelated to the output so we can interpret that life expectancy and income does increase with schooling the same relationship can be said for Diphtheria with Polio and Hepatitis B. Moreover, a big factor affecting the output are the continents and the status of the countries as Africa and other developing countries have less schooling compared to developed countries while with diphtheria more developed countries and continents such as Asia and Europe deal with it.
Our business objectives is aimed for insurance companies that can use our project in predicting the life expectency of a citizen in a country. Moreover, the insurance companies can estimate the cost of the insurance & the potential return in the unfortunate case of death. Furthermore, a prediction of the probability of death based on life expectency. In addition, our project can aid in optimizing the companies profit from the insurance service based on our data analysis.
Understanding our data We have used the data from Kaggle after deeply searching for an adequate dataset that can be used in our analysis. Furthermore, the data provide a realistic aspect of life expectations with a clear structure & quality information that can further aid our project. Subsequently, the limitations of this dataset enabled us to give a highly accurate assessment of futuristic prediction of life expectency.
Our plan Initially, we decided to apply descriptive analysis & exploratary analytics on our data to identify patterns. In addition, our work flow depended on how related each variables to other variables. Moreover, how all variables contribute in predicting the life expectency.
Tools & Technology We have depended heavily on matplotlib library, Seaborn library, pandas, and numpy. In addition, we have used linear regression as a method to predict feature and current life expectency based on features. Collabrative work Each team member brings an original point of view on our data and presented his opionion. Moreover, other members have assessed and elected an elite perspective on the data.
Issues in the implementation of the selected methodology, Many issues could arise in a methodology. Consequently, this allows for enhancing our methodology & reevaluate our work. Fortunately, such issues were minimized in our endevour. However, some problems that could happen are as the following. First, Poor approach towards the data, such as poor descriptive analysis or explortary analytics could result in a misunderstanding of data. To elaborate, the most important part of a data scientist is an understanding of data , where we illustrate and emphasise the meaning of data. Second, Difficulty in establishing and connecting with the project's business objectives and goals. Followingly, high-quality data preparation and acquisition challenges for the project. Another possible problem is using poor technique that may not yield a useful result to the purpose of the project. Moreover, the technology used should be adequate in offering a value for the companies. Other issue, integration of the operationalized system with current systems and procedures is difficult.
Summary and conclusion of our analysis our implicit usage of the regression analysis have yielded several models which can predict certain features based on data analysis. (Polio and Hepatitis B) vs Diphtheria Initially, we have used regression technique to anticipate the countries Diphtheria vaccine on its population, diffrentiated using the continents to establish how geograph, GDP and other labels contribute on vaccine intake. Then, our regrission model have depended on the labels that took the vaccines for the Polio and Hepatitis B, the continent, and the status of the country. Followingly, we have found that both have a deep linear dependency which enables our mechanistic analytics to yield a realistic result. In elaboration, Most european contries are developed and a large proportion are predicted to take all three vaccines. However, Asian contries differ in thier status with the conclusion that developed asian countries are more likely to take the Polio, Hepatitis B, & the Diphtheria vaccines. In addition, African contries are more less likely to take the Polio vaccine, but they are likely to take the Hepatitis B vaccine, and appears that they are less likely to take the Diptheria vaccine based on our analysis. Furthermore, North american & south american countries are showing significant potential in being vaccined by the three vaccines.
Life expectancy vs (Income composition of resources, Schooling)
Another major aspect of our project is the regression model for the life expectency which is the main factor in predicting the cost of insurance. Additionally, the module can optimize profit for life insurance companies and aid in estimating the expected worth of life to pay in case of death based on the data provided. To conclude our finding regarding Life expectency, We have picked Schooling & Income composition of resources because they had the highest correlation among the variables. In addition, the best MSE was achieved in all 3 types. Moreover, Schooling or education is poor in africa due to most african countries being a developing country, and this resulted in a clear lower life expectency relative to other countries which we will discuss. In addition , Life expectency in developed and developing european countries are siginifcant. Hence, most european countries are predicted to have a high life expectency in future years. Moreover, most asian countries, south, and north america have moderate schooling which predicts a lower life expectency than european countries, but higher than african countries. Finally, in general developed countries are more likely to have a higher life expectence prediction in future years.
In conclusion, our regression model shows success in predicting both the life expectency & how much the population is willing to take the Dipththeria vaccine. Moreover, European developed and developing countries have a predicted low risk of dying, a high life expectancy with a population tendecy in taking the vaccine, and high schooling rates. North american & south american countries risk of dying is moderate, with a moderate life expectancy and also shows a moderate enforcing of the Diptheria vaccine to the population. Moreover, African countries schooling rates are much lower than relative countires, also most african countries are developing. Hence, african countries population have a high risk of dying, and a much lower life expectancy with a predicted take of diptheria vaccine if they take the polio vaccine. Oceania developed countries shows moderate life expectancy, and low acceptace to both vaccines and a low predicted take of the diptheria vaccine.
Give possible future recommendations.
Based on our data & project's data analysis, our recommendation for life insurance companies is to offer lower costs for european citizens due to thier high life expectance. and make a moderate cost for North american countries with a similar cost for south american countries. In addition, Oceania's developing countries should have higher costs than Oceana's developed countries due to the varation of life expectancy. However, african countries population should have a massive life insurance cost due to thier extremly low life expecations relative to other countries. Hence Using This analysis insurace companies should be able to optimize thier profit and minimize the lost revenue.